Track per-phase duration on each instance by sjmiller609 · Pull Request #223 · kernel/hypeman

sjmiller609 · 2026-05-11T18:57:09Z

Summary

Adds a small phasetracking package under lib/instances/phasetracking that records how long each instance has spent in each lifecycle phase (running, standby, paused, stopped, etc.). Tracker state lives on StoredMetadata, so it persists with the existing metadata.json per instance — no schema migration.

Instrumentation is done directly at the transition sites in create/start/stop/standby/restore/fork. We intentionally do not subscribe to the existing lifecycle event stream — that pipeline is lossy and these numbers will feed billing, so we want a direct write at the point of transition.

Each Instance API response now carries:

current_phase
current_phase_since
phase_durations_ms (cumulative ms per phase, with the live phase counted up through now)

Why

Today the billing pipeline that consumes hypeman instances charges based on wall-clock since CreatedAt, which double-counts intra-session standby and any other non-running time. To fix that properly we need hypeman itself to report real per-phase durations so the upstream billing calculation can compute true running time. Once this ships, the kernel-api side can switch the formula for hypeman CPU back to a platform-uptime model.

Test plan

go build ./...
go test ./lib/instances/phasetracking/...
go test -run TestInstanceToOAPI ./cmd/api/api/...
Manually exercise an instance through start → standby → restore → stop and confirm phase_durations_ms matches expectations

Note

Medium Risk
Touches instance lifecycle transition paths (create/start/stop/standby/restore/fork) and adds new API fields used for billing, so incorrect phase recording could impact cost/analytics calculations despite being additive and well-tested.

Overview
Adds persistent per-instance lifecycle phase accounting via new lib/instances/phasetracking tracker, recording cumulative time spent in phases at each externally observable transition (including special handling to advance initializing→running based on boot markers).

Exposes this data in the Instances API as current_phase, current_phase_since, and phase_durations_ms (snapshotting live time through response time), and updates OpenAPI/generated oapi models accordingly.

Adjusts fork behavior to reset phase history (and deep-clone metadata to avoid shared maps), and expands unit/integration tests to validate phase accrual across create/standby/restore/fork and API emission/omission rules.

^{Reviewed by Cursor Bugbot for commit 3a95db6. Bugbot is set up for automated code reviews on this repo. Configure here.}

Add a small phasetracking package that records cumulative time spent in each lifecycle phase (running, standby, paused, etc.) using transition bookkeeping. Tracker state is persisted with the instance's stored metadata, so it survives process restarts without a DB migration. Instrument transitions directly in create/start/stop/standby/restore/fork rather than subscribing to the lifecycle event stream — the subscription is lossy, and these numbers will feed billing. Expose current_phase, current_phase_since, and phase_durations_ms on the Instance API so callers (notably the kernel-api billing pipeline) can compute true running time instead of wall-clock since CreatedAt.

github-actions · 2026-05-11T18:57:45Z

✱ Stainless preview builds for hypeman

This PR will update the hypeman SDKs with the following commit message.

feat: Track per-phase duration on each instance

✅ hypeman-openapi studio · code

Your SDK build had at least one "note" diagnostic.
generate ✅

⚠️

hypeman-typescript studio · code

Your SDK build had a failure in the lint CI job, which is a regression from the base state.
generate ✅ → build ✅ → lint ❗ → test ✅

✅ hypeman-go studio · code

Your SDK build had at least one "note" diagnostic.
generate ✅ → build ⏭️ → lint ✅ → test ✅
go get github.com/stainless-sdks/hypeman-go@312c26f13a0c22a91b902b51c13a431452dca79d

This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
If you push custom code to the preview branch, re-run this workflow to update the comment.
Last updated: 2026-05-12 12:40:56 UTC

Piggyback on the firecracker/QEMU standby-restore cycles and the cloud-hypervisor fork-from-running test to assert end-to-end that transition-site instrumentation is wired up: - after standby: Current == standby, Cumulative[running] > 0 - after restore: Current == running, Cumulative[standby] > 0 - after fork-from-running: fork's Cumulative[running] is zero while source's is non-zero — locks down the Phases.Reset() semantics No new tests, no added sleeps. The assertions read state at the same points where the tests already check State.

firetiger-agent · 2026-05-11T19:47:02Z

Monitoring Plan Created

This PR introduces a new phasetracking package that records cumulative wall-clock time in each lifecycle phase (running, standby, stopped, etc.) for every instance. The tracker is embedded in StoredMetadata, updated at every state transition (create, standby, restore, start, stop, fork), and exposed via three new optional fields on all instance API responses: current_phase, current_phase_since, and phase_durations_ms.

The change is purely additive — no existing fields or logic are modified — and pre-existing instance metadata gracefully degrades (zero-value tracker starts accumulating on the first state transition after deploy). The main risks to watch are metadata write failures during state transitions (which could leave on-disk metadata stale) and any disruption to the standby/restore cycle caused by the new Record() calls. Current baselines show 100% hypeman spawn success rate with zero failed invocations over the past 48 hours, deployment p50 of 16–35s, and stable instance creation error counts.

Status updates will be posted automatically on this PR as monitoring progresses.

View agent

The phase tracker's Since field is persisted and exposed in the API as current_phase_since. standby/stop were initializing `now` as local time while create/start/restore use UTC, leaving downstream consumers with mixed timezone offsets in the serialized value depending on which transition last occurred. Align all transition sites on UTC. StoppedAt moves to UTC as a byproduct, which is the correct normalization anyway.

cloneStoredMetadata previously shallow-copied the Phases tracker, which aliased the Cumulative map between source and forked metadata — a subsequent Record on either side would mutate both. Add Tracker.Clone and use it from cloneStoredMetadata. Also normalise the fork transition timestamp to UTC for consistency with the other transition sites.

The recording sites previously jumped straight to PhaseRunning the moment the VMM was up, but the public State machine stays in Initializing until both ProgramStartedAt and GuestAgentReadyAt are hydrated from the guest serial log. That meant Phases.Current reported "running" while the API reported "Initializing". Make phase tracking honest: - create/start record PhaseInitializing on VM boot - restore inspects the preserved markers and records whichever phase the guest is actually in (Running in the common case) - hydrateBootMarkersFromLogs / persistBootMarkers detect the Initializing → Running boundary and Record(PhaseRunning) using the later marker timestamp, so the accrued Initializing duration matches real guest boot time rather than the wall clock when hydration ran Transient internal substates (Paused/Shutdown inside Standby/Stop) remain unrecorded — they're sub-ms blips inside non-yielding orchestration that no external observer can see.

…se-tracking # Conflicts: # lib/instances/query_test.go

They were never recorded — internal Paused/Shutdown substates happen inside non-yielding orchestration calls and are intentionally not tracked (already documented in the package doc).

After a restore from early-standby (instance standbyed before boot markers ever hydrated), Phases.Since is set at restore time. The markers parsed afterwards can carry timestamps from the pre-standby boot session, predating Since by the entire standby interval. Without the clamp, Record would silently skip the negative-elapsed accrual but still move Since backwards — and every subsequent transition would then over-count Running. Since this field feeds billing, clamp forward so Since is monotonic. Adds a regression test covering the early-standby restore path.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit a29fd64. Configure here.}

sjmiller609 added 2 commits May 11, 2026 19:00

gofmt phasetracking

86dcffe

sjmiller609 marked this pull request as ready for review May 11, 2026 19:40

cursor Bot reviewed May 11, 2026

View reviewed changes

Comment thread lib/instances/standby.go

cursor Bot reviewed May 11, 2026

View reviewed changes

Comment thread lib/instances/fork.go

Comment thread lib/instances/fork.go

sjmiller609 requested a review from hiroTamada May 11, 2026 20:35

sjmiller609 added 2 commits May 11, 2026 21:19

Merge remote-tracking branch 'origin/main' into hypeship/instance-pha…

6490aa7

…se-tracking # Conflicts: # lib/instances/query_test.go

hiroTamada reviewed May 11, 2026

View reviewed changes

Comment thread lib/instances/phasetracking/phasetracking.go

hiroTamada reviewed May 11, 2026

View reviewed changes

Comment thread lib/instances/start.go Outdated

sjmiller609 added 2 commits May 11, 2026 21:49

phasetracking: drop unused PhasePaused/PhaseShutdown constants

a29fd64

They were never recorded — internal Paused/Shutdown substates happen inside non-yielding orchestration calls and are intentionally not tracked (already documented in the package doc).

sjmiller609 requested a review from hiroTamada May 11, 2026 22:32

cursor Bot reviewed May 11, 2026

View reviewed changes

Comment thread lib/instances/phasetracking/phasetracking.go Outdated

cursor Bot reviewed May 11, 2026

View reviewed changes

Comment thread lib/instances/query.go

hiroTamada approved these changes May 12, 2026

View reviewed changes

sjmiller609 merged commit 23e332a into main May 12, 2026
11 checks passed

sjmiller609 deleted the hypeship/instance-phase-tracking branch May 12, 2026 12:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track per-phase duration on each instance#223

Track per-phase duration on each instance#223
sjmiller609 merged 9 commits into
mainfrom
hypeship/instance-phase-tracking

sjmiller609 commented May 11, 2026 •

edited by cursor Bot

Loading

Uh oh!

github-actions Bot commented May 11, 2026 •

edited

Loading

Uh oh!

firetiger-agent Bot commented May 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sjmiller609 commented May 11, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Test plan

Uh oh!

github-actions Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✱ Stainless preview builds for hypeman

Uh oh!

firetiger-agent Bot commented May 11, 2026

Monitoring Plan Created

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sjmiller609 commented May 11, 2026 •

edited by cursor Bot

Loading

github-actions Bot commented May 11, 2026 •

edited

Loading